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1. Introduction 

O ■ There are several ideas being used today for Web information retrieval, and specifically in Web 
^ search engines [|20|. The PageRank algorithm [|23 is one of those that introduce a content-neutral 
ranking function over Web pages. This ranking is applied to the set of pages returned by the Google 
^ search engine in response to posting a search query. PageRank is based in part on two simple 
^ common sense concepts: (i) A page is important if many important pages include links to it. (ii) A 
^ page containing many links has reduced impact on the importance of the pages it links to. 

In this paper we focus on asynchronous iterative schemes [|9l[T5l to compute PageRank over large 
^ sets of Web pages. The elimination of the synchronizing phases is expected to be advantageous 
Q on heterogeneous platforms. The motivation for a possible move to such large scale distributed 
^ platforms lies in the size of matrices representing Web structure. In orders of magnitude: 10^^ pages 
O with lO'^ nonzero elements and 10^^ bytes just to store a small percentage of the Web (the already 
crawled); distributed memory machines are necessary for such computations. The present research 
^ '. is part of our general objective, to explore the potential of asynchronous computational models as an 
underlying framework for very large scale computations over the Grid [|l4||. The area of "internet 
^ algorithmics" appears to offer many occasions for computations of unprecedent dimensionality that 
\^ • would be good candidates for this framework. 

O , After giving a formulation of PageRank and its common interpretations in Section|21 we present its 
^ ' treatment under synchronous computational models. We next consider the asynchronous approach 
and comment on key aspects, specifically convergence, termination detection and implementation. In 
J-^ Section|5l we describe the experimental framework and present preliminary numerical experiments, 
■ while in Section|6lwe draw our conclusions and discuss our future work on this topic. In this paper, 
^ . as is common practice, we do not address the effects of finite precision arithmetic and roundoff error. 

2. Formulation and Interpretations 

In order to appreciate the PageRank computation, we present its standard formulation using the 
following set of four nx n matrices, where n is the number of pages being modeled. 

An adjacency matrix A can be obtained through a web crawl or synthetically generated using 
statistical results, e.g., as in [TOl. Thus, A,j = I iff page i points to page j, and Ajj = otherwise. 

A transition matrix P has nonzero elements Pij = A,j/deg(z) when deg(/) 7^ 0, and zero otherwise 
(in which case page / is called a dangling page); here deg(i) = Ey^O' '^he outdegree of page i. 

A stochastic matrix S is given by S = + w d^ ; w = j^e, where e is the size n vector of all I's, 
and d is the dangling index vector whose nonzero elements are J,- = I iff deg(z) = 0. 

The Google matrix GisG = a5+(l — a)ve^. For a random web surfer about to visit his next 
page, the relaxation parameter a is the probability of choosing a link- accessible page. In choosing 
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otherwise, i.e., with probability 1 — a, from the complete Web page set vector v contains respective 
conditional probabilities of such teleportations. Typically v = w and a = 0.85. 
The PageRank vector x is the solution of the linear system 

x = G X , (1) 

where the matrix G is an irreducible stochastic matrix, and thus its largest eigenvalue in magnitude 
is 'kmax = 1 [^1- Thus, the PageRank vector x is the eigenvector corresponding to y^max = 1j and 
when normalized, it is the reachability probability in a random walk on the Web, i.e., the invariant 
measure or stationary probability distribution of a Markov process modeled by the matrix G. It can 
also be computed as the solution to a system of linear equations. Using the fact that x is normalized 
to unity, i.e., x= 1, equation ^ yields 

{I-R)x = b (2) 
where b—{l—a)v and = a 5 is the relaxed stochastic matrix r fT3ll . 



3. Synchronous PageRank 

To make the computation of x practical for the problem sizes we are considering, it is necessary 
to employ an iterative method, e.g., executing until convergence 

x{t + l)^f{x{t)) (3) 

with ? = 0, 1, . . . for a suitable operator, / and some initial vector j£:(0). The vector x{t) denotes the 
approximation to x obtained after t iterations. The above process needs to be mapped on a spe- 
cific execution environment, corresponding to a computational model that typically preserves the 
semantics of the mathematical model in ©. The environment constitutes a virtual machine for the 
computation and is largely characterized by the types of units of execution (UE) (e.g., processes, 
threads) and communication mechanisms (e.g., shared memory, message passing) it readily sup- 
ports, especially in hardware. Execution and communication entities are ultimately hosted by actual 
machines typically attached to nets (e.g., clusters) and internets (e.g., the Internet). 

In the single UE case the aforementioned mapping on the execution environment is straightfor- 
ward. For multiple UEs, however, this requires care: In the shared memory case, a semantics pre- 
serving mapping must involve synchronized access to shared memory cells between cooperating 
UEs, protected by locks, whereas in the message passing case, this synchronization is achieved 
through a barrier mechanism implemented atop collective blocking communication. For the PageR- 
ank computation, we can easily turn (HI) into the following simple iteration: 

x{t-\-\) = Gx{t), x{Q) given. (4) 

That is, /(■) amounts to a matrix- vector multiplication. This is the well-known power method for 
finding the eigenvector of G corresponding to the eigenvalue of largest magnitude [ 26|, except that 
no per-step normalization needs to be performed. The normalization is not needed since a stochastic 
matrix such as G does not alter \\x{t) || i; and thus no danger of overflow or underflow is present here. 

Single UE implementations of (HI) with an emphasis on convergence acceleration, support for 
personalization through different teleportation vectors and utilization of naturally occurring block 
structure in the adjacency matrix A can be found in [[T71[THl[l9l|. For multiple UEs, message passing 
computation of PageRank using the formulation ^ was presented in r fT6l . 
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4. Asynchronous PageRank 

Unfortunately, the necessary per-step synchronization of the synchronous algorithm described 
above grows into a significant overhead, especially as it is governed by the rate of the slowest UE 
and the costs of lock or barrier management. One radical transformation to harness this problem is 
to reduce the requirement for synchronization, e.g., by using non-blocking access to shared memory 
cells or network buffers. A central theme of our work is to investigate the effect of this transformation 
on the convergence, speed and overall effectiveness of the computations. 

For an environment with p UEs, denote by the set of indices assigned to UE during the it- 
erative computation, the set of times at which is updated (i.e., z'^'' UE finishes its computation) 
and x^ (f) the time when the fragment which is available at time t in the UE, was actually 
produced at its respective UE. Then for t eT\ the i'^ UE updates 

Hi} + 1 ) ^ 1 } (xi (0 ) , • • • , x{p} (^p (0 ) ) , (5) 

while 1) = (0 at other times. Delays due to omission of synchronization phases are 

expressed as differences t — '^^(0 > 0. The relation © is the asynchronous analog of © where 
fi expresses the distributed operator component executing at the UE. Obviously the form of /, 
is independent of the asynchronism introduced. It thus follows that the normalization-free power 
method for PageRank computation at the i'^ UE reads 

X{,}(f + l) = G,[4i}(xi(0),---,4riW))]^ (6) 

for t E T', and JC{/}(? + 1) = at other times, where G,- is a set of rows of the Google matrix G 

indexed by {z}. Alternatively, while the synchronous, linear system equation approach would lead 
to an iterative scheme of the form x{t+l) = R x{t) + b which can be seen to be identical to Q, its 
asynchronous formulation would lead to another, slightly different computational kernel, namely 

X{,}(f + l)=i?,[4i}(xU0),---,4p}(^M0)]^ + ^^- (7) 

for t e T', and x^i^ (t + l) = x^i^ (t) at other times, at the UE. Here Rj is a set of rows of the relaxed 
stochastic matrix R indexed by {i}, and bi is the corresponding set of elements of vector b. 

Also of interest are P2P computations of PageRank [ 12i llS |25l |29J These fall into the multiple 
UEs, message passing category and are asynchronous in nature. An important novelty in these 
studies is the dynamically generated link information through a notification protocol proposed to be 
integrated with the host Web servers. 

The lack of synchronization annuls the semantics of the original mathematical algorithm. There- 
fore, it becomes necessary to discuss the convergence properties of the asynchronous scheme ©. 
We discuss this and related issues in the remainder of this section. 

4.1. Convergence 

Convergence of asynchronous iterative algorithms is usually established through constructing a 
sequence of nested boxed sets in the spirit of the following theorem [ 9|: 

Theorem 1 Let {X{k)} : . . . C X{k+1) C X{k) C . . . C X, with the following two conditions. 
Synchronous Convergence Condition: For allk= I, . . ., x E X{k), f{x) EX{k+l), and for {j^} , y'^ E 
X{k) : the limit points of{y^} are fixed points off. 
Box Condition: For all k = I , . . ., X (k) = Xi{k) x . . . x Xp{k). 

Then ifx{0) E X{0), the limit points of{x{t)} are fixed points off, where {x{t)} are given by (0). 
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Process Q involves a nonnegative matrix of unit spectral radius; it is proved in [l2l] that the 
corresponding asynchronous iteration converges to the true solution within a multiplicative factor 
that can easily be factored out in the end by renormalization. A discussion on the misconception 
by some authors that for a nonnegative matrix B, spectral radius p(5) < 1 is a necessary condition 
for convergence of an asynchronous normalization-free power method can be found in [|23. On 
the other hand, process Q involves a matrix R with p{R) < 1. Asynchronous iterations with such 
matrices are well known to converge to the true solution [[9l|. 



computing UE 


monitor UE 


if (checkConvergenceO ) 
if(not converged) 
converged = true 
pc++ 

if(pc = pcMax) 

send(CONVERGE, monitor) 

recv(STOP, monitor) 
else 
if(converged) 

converged = false 

send(DIVERGE, monitor) 

pc = 


recv(CONVERGE | DIVERGE, all) 
if(checkConvergence()) 
if(not converged) 
converged = true 

PC++ 

if(pc = pcMax) 

send(STOP, all) 
else 
if(converged) 

converged = false 

pc = 



Figure 1. pc: persistence counter, pcMax: its max value; reaching it triggers CONVERGE/ STOP mes- 
sages. They can have different values in monitor, computing UEs; all: all computing UEs 



4.2. Termination Detection 

The termination of asynchronous iterative algorithms is a non-trivial matter since local conver- 
gence at an UE does not automatically ensure global convergence. Even in the extreme case when 
all UEs have locally converged, one can devise scenarios where messages not yet delivered could 
destroy local convergence. 

Both centralized and distributed protocols for termination detection can be found in the literature, 
[[HI |24l|. In a centralized approach, a special UE acts as a monitor of the convergence process of 
other computing UEs; it keeps a log of the convergence status and issues STOP messages to all 
computing UEs when all of them have signaled their local convergence. In fact, computing UEs 
can issue either CONVERGE (when achieving local convergence) or DIVERGE (when exiting such a 
state) messages to the monitor UE. Distributed protocols for global convergence detection (see, e.g., 
r l28l) are flexible but rather complex to implement. They typically assume a specific underlying 
communication topology. For example in [ |6J a leader election protocol is used, which in turn 
assumes a tree topology. 

Our draft version of a practical centralized protocol, in part inspired by [^, is presented in Fig- 
ure [T] It enforces persistence of convergence both at the computing UEs (for issuing a CONVERGE 
message) and at the monitor UE (for issuing a STOP message). Persistence is introduced to provide 
time for pending -and perhaps divergence causing- messages to be actually delivered. 
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4.3. Implementation 

We focus on multiple UEs message passing environments, which is the case for our experiments. 
In that case, we need non-blocking communication primitives. These are actually implemented 
either by using multithreading (e.g., one thread per communication channel) or by multiplexing 
such channels and probing from within a single thread (through select () type mechanisms) for 
new data. Since multiple messages might have been received in the meantime, messages should be 
kept in queues organized under a common discipline. 

5. Numerical Experiments 

5.1. Application Structure 

Our application consists of scripts steering Java classes. These scripts are written in Jython [|2|, 
which is an implementation of the Python [ 4J programming language in Java [ IJ. Such a mixed- 
language approach facilitates writing portable, interactive, easily extensible and flexible systems; 
after all, performance critical operations can always be isolated into compiled .class code. 

Scripts build and use objects. Configuration objects can load/store parameters from/to config- 
uration files - accessible from all other objects, partition and distribute matrix or vector data and 
optionally send code or launch processes over the cluster nodes. Computation objects perform 
computations and exchange information related to convergence status with Monitor objects imple- 
menting the termination detection protocol; cf. Figure [T] Communications are established through 
Communication objects which set up suitable communicators upon their instantiation; these com- 
municators expose communication primitives to be invoked at each step. 

We use multithreading in order to implement non-blocking communications. An asynchronous 
sendO or recv() is just its blocking counterpart wrapped in a thread object and submitted to a 
thread pool endowed with a suitable task-handling strategy. Data are imported/exported through 
read/write channels with locks synchronizing those concurrently executing threads which happen 
to be managing messages with identical source and target IDs. Access to thread pool queues and 
pending communication-task-handles is provided so that a customized thread-management policy 
can be applied. At startup, a single file containing computation parameters should be available. 
This file is used by a Configuration object for the generation of node-specific configuration files 
and a script for distributing these files (optionally with other data or updated source code files) to 
the cluster nodes and initiating the computation. An option for automatic report generation is also 
provided. 

5.2. Numerical Results 

We used a Beowulf cluster of Pentium-class machines at 900 MHz, with 256 MB RAM each, run- 
ning Linux, version 2.4 and connected to a 10 Mbps Ethernet LAN. We used Java 5.0, Jython 2. 1 and 
Matrix Toolkits for Java [ 3| for composing our scripts and classes, all freely available on the Web. 
We report on some of the results of this ongoing work. The transition matrix used in the experiments 
is the Stanford- Web matrix [13, generated from an actual web-crawl. It contains connectivity info 
for 281,903 pages (2,312,497 non-zero elements, 172 dangling nodes). We used the computational 
kernel ® with a local convergence threshold of 10^^. Note that in each case, blocks of consecu- 
tive [n/p] rows were distributed among computing machines. Termination detection used pcMax = 
1 on both monitor and computing UEs. Configurations with 2, 4, and 6 machines were tested for 
both synchronous and asynchronous computations. Results in Tableware encouraging. On the other 
hand, it is fair to note that they correspond to reaching local convergence threshold. Assembling vec- 
tor fragments resulting from asynchronous computations at monitor UE and then checking global 
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Table 1 

Numerical results: For the asynchronous case iteration ranges, computation time ranges are given 
([max, min] values) since local convergence threshold is not 'simultaneously' reached at all nodes. 
A column with the average speedup offered by asynchronous computation over synchronous one is 
given (averages are over extreme values in the asynchronous case). 



convergence reveals that a threshold of the order of 5 x 10^ has actually been reached. Preliminary 
results of timing with respect to reaching a common global threshold (instead of a local one) reveals 
a modest speedup of asynchronous vs. synchronous computation in the 10 — 20% range. Respon- 
sibility for the degradation of performance when increasing the number of UE's appears to lie with 
the overall large communication-to-computation ratio of the current algorithm. Observe, however, 
that what is important are not the accurate values of the PageRank vector components, but their rel- 
ative ranking. Therefore, an issue in our present investigations is the effect of a more relaxed global 
threshold criterion on the computed page ranks. 

Asynchronous iterative algorithms also seem to naturally adapt to heavy communication demands 
in a computation; current sendO /receive () threads can block but computation thread is free to 
advance to next step iteration. On the contrary, in synchronous mode, no option exists except for 
blocking all threads (even the computation one), until data emitted from all nodes actually reach 
their destinations and synchronization completes, no matter whether the supporting network's char- 
acteristics suffice. In this case asynchronous computation can exhibit a low message import ratio 
(always with respect to iteration count which is obviously increased relative to synchronous setting); 
see Table El 
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Table 2 

Completed imports for the 4 computing UEs, asynchronous case. Rows contain the number of 
different vector fragments actually received during the computation from peers with respective IDs. 
Diagonal numerical entries contain the total number of locally computed and thus locally used vector 
fragments. Completed Imports column contains percentage averages of imports actually completed 
(should all be 100% for the synchronous case). 
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6. Conclusions and Future Work 

The major performance bottleneck in our experiments to date is due to the large volume of data 
and the frequency that it is being produced. The latter is caused by the small computation time 
per-iteration (sparse matrix-vector multiplication). Note also that the communication pattern is an 
all-to-all scheme at each step; all these factors conspire to surpass the available network bandwidth 
and thus build memory consuming buffers of pending messages at the sending ends. 

In the case of asynchronous iterations, data is being produced at a rate that is even higher than in 
the synchronous case, because part of the time gained from eliminating the synchronization phases is 
actually used for the production of extra messages; these (favorably) advance local iteration counters 
but they could also (unfortunately) overload the network; we guard against this misfortune by can- 
celling send ( ) /recv ( ) threads not having completed within a time window. The following is thus 
a hardly surprising conclusion from our experiments: Asynchronous iterative algorithms make up an 
alternative computation methodology in distributed environments. However this is not a black-box 
methodology and is most effectively utilized by iterative methods with heavy computational com- 
ponent and light communication. A frequent, all-to-all, fat message passing can saturate network 
infrastructure capacity, even in modest but dedicated cluster environments; heterogeneous environ- 
ments like the Grid would be even more sensitive to such message passing scenarios. We would thus 
like to avoid the use of all-to-all communication schemes; after all the flexibility of asynchronous 
iterations gives us a choice on the targets of produced messages. Furthermore, it is advisable to em- 
ploy an adaptive communication scheme; if message sending/receiving tasks fail to complete within 
a number of local iterations, reduce the rate of message exchanges with this not well 'responding' 
node. 

In our ongoing work, we explore adaptive schemes for the asynchronous computation of PageR- 
ank. We also experiment with select ( ) based implementations of asynchronism in order to amor- 
tize thread management costs. Since trees are naturally occurring internetwork topologies we also 
plan to study the performance of moving a clique-based (i.e., all-to-all) synchronous iterative method 
to an asynchronous, tree-based counterpart. We are also considering the use of suitable permutations 
(cf. [ JJJ) as well as larger data sets. 
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